Shape-aware stochastic neighbor embedding for robust data visualisations

Background The t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm has emerged as one of the leading methods for visualising high-dimensional (HD) data in a wide variety of fields, especially for revealing cluster structure in HD single-cell transcriptomics data. However, t-SNE often fails to correctly represent hierarchical relationships between clusters and creates spurious patterns in the embedding. In this work we generalised t-SNE using shape-aware graph distances to mitigate some of the limitations of the t-SNE. Although many methods have been recently proposed to circumvent the shortcomings of t-SNE, notably Uniform manifold approximation (UMAP) and Potential of heat diffusion for affinity-based transition embedding (PHATE), we see a clear advantage of the proposed graph-based method. Results The superior performance of the proposed method is first demonstrated on simulated data, where a significant improvement compared to t-SNE, UMAP and PHATE, based on quantitative validation indices, is observed when visualising imbalanced, nonlinear, continuous and hierarchically structured data. Thereafter the ability of the proposed method compared to the competing methods to create faithfully low-dimensional embeddings is shown on two real-world data sets, the single-cell transcriptomics data and the MNIST image data. In addition, the only hyper-parameter of the method can be automatically chosen in a data-driven way, which is consistently optimal across all test cases in this study. Conclusions In this work we show that the proposed shape-aware stochastic neighbor embedding method creates low-dimensional visualisations that robustly and accurately reveal key structures of high-dimensional data.

Formalism of t-SNE Here we provide some mathematical details of the t-SNE method. Suppose there are n data points, the first step is to transform the distances in the HD space into a probability distribution. Specifically, a 'directed' measure of similarity from point x i to point x j in the HD space (with i, j = 1, . . . , n) is defined as a conditional probability in terms of the Gaussian kernel and the softmax function, The self similarity p i|i is set to 0. Here δ ij denotes the distance between points x i and x j , which is the conventional distance measure, e.g. ED, in the t-SNE and the BHD in the SASNE. The variable standard deviations σ j (with j = 1, · · · , n) can be fixed by choosing a constant value for the perplexity, P, defined by In Eq. (2), H(p ·|j ) denotes the Shannon entropy [1] of the probability distribution p ·|j , defined as H(p ·|j ) = − i =j p i|j log p i|j . The perplexity can vary between 1 and n − 1 and it corresponds to the effective number of neighbors around a point x j covered by the Gaussian kernel with standard deviation σ j . Points beyond the perplexity range will simply be counted as 'faraway'. When perplexity equals 1, it corresponds to the case σ j → 0 that all probability mass is placed on the nearest neighbor. On the other hand, when perplexity equals n − 1, it corresponds to the case σ j → ∞ in which all neighbors are weighted equally. The perplexity is the main hyper-parameter of t-SNE methods that needs to be determined. Moreover, the probability distribution is symmetrised as p ij = p i|j +p j|i 2n for computational convenience. Similarly, the distances in the LD embedding space are also transformed into a probability distribution in terms of the long-tailed t-distribution with one degreeof-freedom as follows Here in the LD embedding space, the ED is used for both the t-SNE and SASNE. The self-similarity q ii is again set to zero. The LD embedding coordinates y i are then obtained by minimizing the Kullback-Leibler (KL) divergence, KL(p, q) = i j =i p ij log pij qij , as a cost function between the probability distributions, p ij and q ij , using gradient-based methods. The KL divergence has the property that KL(p, q) = 0 if and only if p ij = q ij for all i and j.

t-SNE optimisation
The optimisation procedures of t-SNE are as follows: In both the t-SNE and SASNE methods, one minimises numerically the KL divergence between the probability distributions, p ij and q ij , as described above by gradient descent. Since the cost function is not convex, the optimisation may converge to a local minimum and therefore the solution may depend on the initialisation, i.e., the initial configuration of the coordinates y i with i = 1, · · · , n in the LD space.
In case of optimising t-SNE, we follow the protocol of Kobak and Berens [2] that the optimisation is initialised with the two leading principal components of the HD data set, normalised by the standard deviation of the corresponding principal component. The initial configuration is further multiplied by a factor of 10 −4 which was shown empirically to speed up the convergence. For the SASNE optimisation, we use a similar initialisation procedure as in the case of the t-SNE but apply it to the BHD.
We also adopted the optimisation trick to multiply all HD probabilities p ij by a constant α = 12, called early exaggeration, for the first 250 iterations. which was shown to lead to better cluster separation [3]. Moreover, as originally suggested by Belkina et al. [4], the learning rate in the gradient descent is set to η = n/α, where n is the number of points, which has shown to lead to improved convergence behaviour in terms of stability and speed. Given the above settings, the optimisation was performed by the tsne function provided by the MATLAB Statistics and Machine Learning Toolbox.
Hyper-parameters of PHATE and UMAP Default parameters for PHATE are used as suggested in the original publication [5] and for UMAP we follow the default parameters used by Becht et al. [6].

Implementation All methods are implemented in MATLAB version 2022a
Computing the biharmonic distance Given a graph G defined by a n × n similarity matrix W with elements w ij = 1/ x i − x j 2 , one can compute the graph Laplacian . Here D is the diagonal degree matrix with elements d i = k w ik that is the degree of the node i. The BHD between the points x i and x j can be expressed in terms of the eigendecomposition of L as , where λ k is the kth eigenvalue, v ik is the ith element of the kth eigenvectors of L, and and Vol(G) = i d i is the volume of the graph G. This expression also shows that the BHD has the form of an ED, i.e., sum of squares n k=2 (z 2 ik − z 2 jk ), with the (n − 1)D Euclidean coordinates for the ith data point given by z ik = v ik / Vol(G)/λ 2 k (k = 2, · · · , n). A convenient property of these coordinates are that the corresponding covariance matrix is diagonal. Therefore the PCA initialisation based on the BHD is simply the leading coordinates with largest corresponding eigenvalues, in this Euclidean space. Hence, after computation of the eigendecomposition of L, these coordinates can be directly input to any standard t-SNE implementation where we keep optimisation scheme consistent with the original t-SNE algorithm.
In this study we computed the BHD using the symmetric Laplacian L sym = D − 1 2 LD − 1 2 [7], instead of L, since this resulted empirically in better results based on the RRP. The computation of the graph distance is now based on the modified formula compared to the one above, given by Here v ·k and λ k denotes the eigenvector and eigenvalue of L sym respectively. Notice that the distance given by C ij = n k=2 (z Pre-processing of single-cell data We follow the same pre-processing procedures as used by Kobak et al. [2] as follows: Let n c and n g be, respectively, the number of cells and the number of genes under consideration. We denote x ig as the expression level of gene g (g = 1, · · · , n g ) in cell i (i = 1, · · · , n c ). The fraction of cells that do not express the gene g is given by d g = 1 nc nc i=1 I(x ig = 0), where the indicator function I(x ig = 0) is equal to 1 when x ig = 0, and 0 otherwise. Furthermore, the mean logexpression level of the gene g can be expressed as m g = 1 is the number of cells with non-zero expression of gene g. The next step adopts a heuristic approach from the work of Kobak et al. [2] to select 1000 genes by finding a value of b such that there are exactly 1000 genes that exhibit high fraction of zero-expression levels across cells in relation to its mean expression value, which has shown to be able to select biologically relevant genes [8]. Mathematically, this is done by finding a value b such that exactly 1000 genes satisfying the relation d g > exp − 3 2 (m g − b) + 0.02 can be selected. The coefficient 3 2 and 0.02 are chosen for a good distributional fit [2]. This selected subset of 1000 genes is then kept for the analysis, whereas the others are discarded. Finally, the log(1 + x ig ) transformation is applied to the counts of the 1000 selected genes to even out the variance of the larger expression levels. That is, the relative expression difference is considered as opposed to the absolute difference so that, for example, an expression difference from 1 to 5 is considered equal to the difference between 100 and 500.
Author details